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Abstract 

For the problem of high-dimensional sparse linear regression, it is known that an phased 
estimator can achieve a 1/n “fast” rate on the prediction error without any conditions on the 
design matrix, whereas in the absence of restrictive conditions on the design matrix, popular 
polynomial-time methods only guarantee the 1 /yfn “slow” rate. In this paper, we show that 
the slow rate is intrinsic to a broad class of M-estimators. In particular, for estimators based on 
minimizing a least-squares cost function together with a (possibly non-convex) coordinate-wise 
separable regularizer, there is always a “bad” local optimum such that the associated prediction 
error is lower bounded by a constant multiple of 1 /y/n. For convex regularizers, this lower bound 
applies to all global optima. The theory is applicable to many popular estimators, including 
convex U-based methods as well as M-estimators based on nonconvex regularizers, including 
the SCAD penalty or the MCP regularizer. In addition, we show that the bad local optima are 
very common, in that a broad class of local minimization algorithms with random initialization 
will typically converge to a bad solution. 


1 Introduction 

The classical notion of minimax risk, which plays a central role in decision theory, allows for the 
statistician to implement any possible estimator, regardless of its computational cost. For many 
problems, there are a variety of estimators, which can be ordered in terms of their computational 
complexity. Given that it is usually feasible only to implement polynomial-time methods, it has 
become increasingly important to study computationally-constrained analogues of the minimax 
estimator, in which the choice of estimator is restricted to a subset of computationally efficient es¬ 
timators [32] . A fundamental question is when such computationally-constrained forms of minimax 
risk estimation either coincide or differ in a fundamental way from their classical counterpart. 

The goal of this paper is to explore such gaps between classical and computationally practical 
minimax risks, in the context of prediction error for high-dimensional sparse regression. Our main 
contribution is to establish a fundamental gap between the classical minimax prediction risk and 
the best possible risk achievable by a broad class of M -estimators based on coordinate-separable 
regularizers, one which includes various nonconvex regularizers that are used in practice. 

In more detail, the classical linear regression model is based on a response vector y € and a 
design matrix X £ M nxrf that are linked via the relationship 

y = X9* + w, (1) 
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where the vector w € M n is a random noise vector. Our goal is to estimate the unknown regression 
vector 9* € Throughout this paper, we focus on the standard Gaussian model, in which the 
entries of the noise vector w are i.i.d. iV(0, a 2 ) variates, and the case of deterministic design, in 
which the matrix X is viewed as non-random. In the sparse variant of this model, the regression 
vector is assumed to have a small number of non-zero coefficients. In particular, for some positive 
integer k < d, the vector 6* is said to be fc-sparse if it has at most k non-zero coefficients. Thus, the 
model is parameterized by the triple (n, d, k ) of sample size n, ambient dimension d, and sparsity 
k. We use Bo (A;) to the denote the I?o-“ball” of all d-dimensional vectors with at most k non-zero 
entries. 

An estimator 9 is a measurable function of the pair (y,X), taking values in M. d , and its quality 
can be assessed in different ways. In this paper, we focus on its fixed design prediction error , given 
by E[4||X(0 — (9*) 11 2 ], a quantity that measures how well 9 can be used to predict the vector X8* 
of noiseless responses. The worst-case prediction error of an estimator 9 over the set Bo (A;) is given 
by 

M nAd {9-X):= sup lE[||A(d-r)|||] (2) 

6>*eB 0 (fc) n 

Given that 9* is A;-sparse, the most direct approach would be to seek a ^-sparse minimizer to the 
least-squares cost ||y — A0|||, thereby obtaining the fo-based estimator 

Olo e arg min ||y - X9 |||. (3) 


The ^o-based estimator 9i 0 is known [7, 28] to satisfy a bound of the form 

a 2 k log d 


t o j -^0 


n 


( 4 ) 


where ^ denotes an inequality up to constant factors (independent of the triple (n, d, k) as well as 
the standard deviation a). However, it is not tractable to compute this estimator in a brute force 
manner, since there are (^) subsets of size k to consider. 

The computational intractability of the £o~based estimator has motivated the use of various 
heuristic algorithms and approximations, including the basis pursuit method [10], the Dantzig 
selector [8], as well as the extended family of Lasso estimators [30, 10, 36, 2], Essentially, these 
methods are based on replacing the -^-constraint with its l\ -equivalent, in either a constrained 
or penalized form. There is now a very large body of work on the performance of such methods, 
covering different criteria including support recovery, t^-norm error and prediction error (e.g., see 
the book [6] and references therein). 

For the case of fixed design prediction error that is the primary focus here, such G-based esti- 
rnators are known to achieve the bound (4) only if the design matrix X satisfies certain conditions, 
such as the restricted eigenvalue (RE) condition or compatibility condition [4, 31] or the stronger 
restricted isometry property [8]; see the paper [31] for an overview of these various conditions, 
and their inter-relationships. Without such conditions, the best known guarantees for G-based 
estimators are of the form 


■M-n,k,d(Qh'iX) 7} cr R 



( 5 ) 
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a bound that is valid without any RE conditions on the design matrix X whenever the L-sparse 
regression vector 9* has -G-norm bounded by R (e.g., see the papers [7, 26, 28].) 

The substantial gap between the “fast” rate (4) and the “slow” rate (5) leaves open a funda¬ 
mental question: is there a computationally efficient estimator attaining the bound (4) for general 
design matrices? In the following subsections, we provide an overview of the currently known results 
on this gap, and we then provide a high-level statement of the main result of this paper. 


1.1 Lower bounds for Lasso 

Given the gap between the fast rate (4) and the Lasso’s slower rate (5), one possibility might be 
that existing analyses of prediction error are overly conservative, and ^i-based methods can actually 
achieve the bound (4), without additional constraints on X. Some past work has given negative 
answers to this quesiton. Foygel and Srebro [15] constructed a 2-sparse regression vector and a ran¬ 
dom design matrix for which the Lasso prediction error with any choice of regularization parameter 
X n is lower bounded by 1/y/n. In particular, their proposed regression vector is 9* = (0,..., 0, i, ^). 
In their design matrix, the columns are randomly generated with distinct covariances, and more¬ 
over, such that the rightmost column is strongly correlated with the other two columns on its left. 
With this particular regression vector and design matrix, they show that Lasso’s prediction error 
is lower bounded by 1 f\fn for any choice of Lasso regularization parameter A. This construction 
is explicit for Lasso, and thus does not apply to more general M-estimators. Moreover, for this 
particular counterexample, there is a one-to-one correspondence between the regression vector and 
the design matrix, so that one can identify the non-zero coordinates of 9* by examining the design 
matrix. Consequently, for this construction, a simple reweighted form of the Lasso can be used to 
achieve the fast rate. In particular, the reweighted Lasso estimator 


9 W i € arg min 
0eR d 


\\y-X9\\ 2 2 + Xj^a j \9 j \^, 


( 6 ) 


with A chosen in the usual manner (A x weights c^-i = a d = b and the remaining weights 

{oq,..., ad- 2 } chosen to be sufficiently large, has this property. Dalalyan et al. [12] construct a 
stronger counter-example, for which the prediction error of Lasso is again lower bounded by 1 /\/n. 
For this counterexample, there is no obvious correspondence between the regression vector and the 
design matrix. Nevertheless, as we show in Appendix A, the reweighted Lasso estimator (6) with a 
proper choice of the regularization coefficients still achieves the fast rate on this example. Another 
related piece of work is by Candes and Plan [9]. They construct a design matrix for which the 
Lasso estimator, when applied with the usual choice of regularization parameter A x c(^yp) 1 / 2 , 
has sub-optimal prediction error. Their matrix construction is spiritually similar to ours, but the 
theoretical analysis is limited to the Lasso for a particular choice of regularization parameter. It does 
not rule out the possibility that other choices of regularization parameters, or other polynomial¬ 
time estimators can achieve the fast rate. In contrast, our hardness result applies to general M- 
estimators based on coordinatewise separable regularizers, and it allows for arbitrary regularization 
parameters. 
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1.2 Complexity-theoretic lower bound for polynomial-time sparse estimators 

In our own recent work [35], we have provided a complexity-theoretic lower bound that applies 
to a very broad class of polynomial-time estimators. The analysis is performed under a standard 
complexity-theoretic condition—namely, that the class NP is not a subset of the class P/poly 
and shows that there is no polynomial-time algorithm that returns a k -sparse vector that achieves 
the fast rate. The lower bound is established as a function of the restricted eigenvalue of the 
design matrix. Given sufficiently large (n, k, d ) and any 7 > 0, a design matrix X with restricted 
eigenvalue 7 can be constructed, such that every polynomial-time /c-sparse estimator 0 po i y has its 
minimax prediction risk lower bounded as 

M u ,ad p , ly ;X)t C,H1 ~‘ i0gi , (7) 

777 , 

where S > 0 is an arbitrarily small positive scalar. Note that the fraction k~ s / 7 , which characterizes 
the gap between the fast rate and the rate (7), could be arbitrarily large. The lower bound has the 
following consequence: any estimator that achieves the fast rate must either not be polynomial-time, 
or must return a regression vector that is not fc-sparse. 

The condition that the estimator is L-sparse is essential in the proof of lower bound (7). In 
particular, the proof relies on a reduction between estimators with low prediction error in the 
sparse linear regression model, and methods that can solve the 3-set covering problem [24], a 
classical problem that is known to be NP-hard. The 3-set covering problem takes as input a list 
of 3-sets, which are subsets of a set S whose cardinality is 3 k. The goal is to choose k of these 
subsets in order to cover the set S. The lower bound (7) is established by showing that if there 
is a L-sparse estimator achieving better prediction error, then it provides a solution to the 3-set 
covering problem, as every non-zero coordinate of the estimate corresponds to a chosen subset. 
This hardness result does not eliminate the possibility of finding a polynomial-time estimator that 
returns dense vectors satisfying the fast rate. In particular, it is possible that a dense estimator 
cannot be used to recover a a good solution to the 3-set covering problem, implying that it is not 
possible to use the hardness of 3-set covering to assert the hardness of achieving low prediction 
error in sparse regression. 

At the same time, there is some evidence that better prediction error can be achieved by dense 
estimators. For instance, suppose that we consider a sequence of high-dimensional sparse linear 
regression problems, such that the restricted eigenvalue 7 = 7 n of the design matrix X € M nxrf 
decays to zero at the rate 7 n = 1/n 2 . For such a sequence of problems, as n diverges to infinity, 
the lower bound (7), which applies to A;-sparse estimators, goes to infinity, whereas the Lasso 
upper bound (5) converges to zero. Although this behavior is somewhat mysterious, it is not a 
contradiction. Indeed, what makes Lasso’s performance better than the lower bound (7) is that it 
allows for non-sparse estimates. In this example, truncating the Lasso’s estimate to be fc-sparse will 
substantially hurt the prediction error. In this way, we see that proving lower bounds for non-sparse 
estimators—the problem to be addressed in this paper—is a substantially more challenging task 
than proving lower bound for estimators that must return sparse outputs. 

1.3 Main results of this paper 

With this context in place, let us now turn to a high-level statement of the main results of this 
paper. More precisely, our contribution is to provide additional evidence against the polynomial 
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achievability of the fast rate (4), in particular by showing that the slow rate (5) is a lower bound 
for a broad class of M-estimators, namely those based on minimizing a least-squares cost function 
together with a coordinate-wise decomposable regularizer. In particular, we consider estimators 
that are based on an objective function of the form L(9 ; A) = ^|| y — X9 1|| + A p{9 ), for a weighted 
regularizer p : M. d —>• R that is coordinate-separable. See Section 2.1 for a precise definition of this 
class of estimators. Our first main result (Theorem 1) establishes that there is always a matrix 
X € R nxd such that for any coordinate-wise separable function p and for any choice of weight 
A > 0, the objective L always has at least one local optimum 9\ such that 


sup E\-\\X(9 x -9*)g 
e*m 0 (k) ln 



( 8 ) 


Moreover, if the regularizer p is convex, then this lower bound applies to all global optima of the 
convex criterion L. This lower bound is applicable to many popular estimators, including the ridge 
regression estimator [19], the basis pursuit method [10], the Lasso estimator [30], the weighted 
Lasso estimator [36], the square-root Lasso estimator [2], and least squares based on nonconvex 
regularizers such as the SCAD penalty [14] or the MCP penalty [34], 

In the nonconvex setting, it is impossible (in general) to guarantee anything beyond local 
optimality for any solution found by a polynomial-time algorithm [17]. Nevertheless, to play the 
devil’s advocate, one might argue that the assumption that an adversary is allowed to pick a bad 
local optimum could be overly pessimistic for statistical problems. In order to address this concern, 
we prove a second result (Theorem 2) that demonstrates that bad local solutions are difficult 
to avoid. Focusing on a class of local descent methods, we show that given a random isotropic 
initialization centered at the origin, the resulting stationary points have poor mean-squared error— 
that is, they can only achieve the slow rate. In this way, this paper shows that the gap between the 
fast and slow rates in high-dimensional sparse regression cannot be closed via standard application 
of a very broad class of methods. In conjunction with our earlier complexity-theoretic paper [35], 
it adds further weight to the conjecture that there is a fundamental gap between the performance 
of polynomial-time and exponential-time methods for sparse prediction. 

The remainder of this paper is organized as follows. We begin in Section 2 with further back¬ 
ground, including a precise definition of the family of M-estimators considered in this paper, some 
illustrative examples, and discussion of the prediction error bound achieved by the Lasso. Section 3 
is devoted to the statements of our main results, along with discussion of their consequences. In 
Section 4, we provide the proofs of our main results, with some technical lemmas deferred to the 
appendices. We conclude with a discussion in Section 5. 


2 Background and problem set-up 

As previously described, an instance of the sparse linear regression problem is based on observing 
a pair (X, y) G R nxrf x R" of instances that are linked via the linear model (1), where the unknown 
regressor 9* is assumed to be k- sparse, and so belongs to the ^o-ball Bo(fc). Our goal is to find a 
good predictor, meaning a vector 9 such that the mean-squared prediction error ^\\X(9 — 9*)\\\ is 
small. 
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2.1 Least squares with coordinate-separable regularizers 

The analysis of this paper applies to estimators that are based on minimizing a cost function of 
the form 

mx) = -\\y-xeg + x P (e), ( 9 ) 

n 

where p : M. d —>• R is a regularize r, and A > 0 is a regularization weight. We consider the following 
family T of coordinate-separable regularizers: 

(i) The function p : — > R is coordinate-wise decomposable, meaning that p(0) = J2j=i Pj(@j) 

for some univariate functions pj : R —> M. 

(ii) Each univariate function satisfies pj{ 0) = 0 and is symmetric around zero (i.e., Pj(t) = Pj(—t ) 
for all t € M). 

(iii) On the nonnegative real line [0,+oo), each function pj is nondecreasing. 

Let us consider some examples to illustrate this definition. 


Bridge regression: The family of bridge regression estimates [16] take the form 

1 d 

^bidge € argmin (-[[y - X6\\ 2 2 + A Y] |<9| 7 ). 

em d fn “ J 

1=1 

Note that this is a special case of the objective function (9) with pj(-) = | • | 7 for each coordinate. 
When 7 € {1,2}, it corresponds to the Lasso estimator and the ridge regression estimator respec¬ 
tively. The analysis of this paper provides lower bounds for both estimators, uniformly over the 
choice of A. 

Weighted Lasso: The weighted Lasso estimator [36] uses a weighted ti-norm to regularize the 
empirical risk, and leads to the estimator 


1 d 

€ argmin {-\\y — X9\\l + A V' aj[0j|). 

6>eM d U “ J 

1=1 

Here a\, ..., are weights that can be adaptively chosen with respect to the design matrix X. 
The weighted Lasso can perform better than the ordinary Lasso, corresponding to the special 
case in which all aj are all equal. For instance, on the counter-example proposed by Foygel and 
Srebro [15], for which the ordinary Lasso estimator achieves only the slow 1 j\fn rate, the weighted 
Lasso estimator achieves the l/n convergence rate. Nonetheless, the analysis of this paper shows 
that there are design matrices for which the weighted Lasso, even when the weights are chosen 
adaptively with respect to the design, has prediction error at least a constant multiple of 1 j\fn. 
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2.5 



value of t 


Figure 1: Plots with regularization weight A = 1, and parameters a = 3.7 for SCAD, and b = 2.7 
for MCP. 


Square-root Lasso: The square-root Lasso estimator [2] is defined by minimizing the criterion 

# sq rt € argmin {-^=||y ~ X0\\ 2 + A ). 

eeR d f y/n > 

This criterion is slightly different from our general objective function (9), since it involves the square 
root of the least-squares error. Relative to the Lasso, its primary advantage is that the optimal 
setting of the regularization parameter does not require the knowledge of the standard deviation of 
the noise. For the purposes of the current analysis, it suffices to note that by Lagrangian duality, 
every square-root Lasso estimate # sqrt is a minimizer of the least-squares criterion \\y— X#|| 2 , subject 
to ||#||i < R, for some radius R > 0 depending on A. Consequently, as the weight A is varied over 
the interval [0, oo), the square root Lasso yields the same solution path as the Lasso. Since our 
lower bounds apply to the Lasso for any choice of A > 0, they also apply to all square-root Lasso 
solutions. 


SCAD penalty or MCP regularizer: Due to the intrinsic bias induced by fi-regularization, 
various forms of nonconvex regularization are widely used. Two of the most popular are the SCAD 
penalty, due to Fan and Li [14], and the MCP penalty, due to Zhang et al. [34]. The family of 
SCAD penalties takes the form 


0a(*) := 


1 

A 


>1 

< — (f 2 — 2aA|f| + A 2 )/(2a 
> + l)A 2 /2 


for \t\ < A, 

2 ) for A < \t\ < aX, 
for |t| > aA, 
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where a > 2 is a fixed parameter. When used with the least-squares objective, it is a special case 
of our general set-up with Pj(9j) = <j>\(0j) for each coordinate j = 1 ,,d. Similarly, the MCP 
penalty takes the form 


P\(t) ■= 



dz, 


where b > 0 is a fixed parameter. It can be verified that both the SCAD penalty and the MCP 
regularizer belong to the function class T previously defined. See Figure 1 for a graphical illustration 
of the SCAD penalty and the MCP regularizer. 


2.2 Prediction error for the Lasso 


We now turn to a precise statement of the best known upper bounds for the Lasso prediction error. 
We assume that the design matrix satisfies the column normalization condition. More precisely, 
letting Xj € M" denote the j th column of the design matrix X. we say that it is 1-column normalized 
if 

— < 1 for j = 1 , 2 ,..., d. 

Our choice of the constant 1 is to simplify notation; the more general notion allows for an arbitrary 
constant C in this bound. 

In addition to the column normalization condition, if the design matrix further satisfies a 
restricted eigenvalue (RE) condition [4, 31], then the Lasso is known to achieve the fast rate (4) 
for prediction error. More precisely, restricted eigenvalues are defined in terms of subsets S of the 
index set {1,2,..., o?}, and a cone associated with any such subset. In particular, letting S c denote 
the complement of S, we define the cone 

C (S):={9cR d | ||0 S c||i<3||0 s ||i}. 




Here ||0sc||i : = J2jes c \®j\ corresponds to the I?i-norm of the coefficients indexed by S c , with ||#s||i 
defined similarly. Note that any vector 9* supported on S belongs to the cone C(5); in addition, 
it includes vectors whose £i-norm on the “bad” set S c is small relative to their £i-norm on S. 
Given triplet (n,d,k), the matrix X € M nxrf is said to satisfy a y-RE condition (also known as a 
compatibility condition) if 

-||X0|||> 7 ||0||| for all 9 € \J C(5). (11) 

n |S|=fc 


The following result [4, 25, 6 ] provides a bound on the prediction error for the Lasso estimator: 

Proposition 1 (Prediction error for Lasso with RE condition). Consider the standard linear model 
for a design matrix X satisfying the column normalization condition (10) and the j-RE condition. 

Then for any vector 9* G Bo(&)> the Lasso estimator 9\ n with \ n = 4cr satisfies 

-\\X9 Xn - wrIII < - for any 9* € B 0 (fc) (12) 

with probability at least 1 — cie _C2?lA ™. 





The Lasso rate (12) will match the optimal rate (4) if the RE constant 7 is bounded away from 
zero. If 7 is close to zero, then the Lasso rate could be arbitrarily worse than the optimal rate. 
It is known that the RE condition is necessary for recovering the true vector 9* [see, e.g. 28], but 
minimizing the prediction error should be easier than recovering the true vector. In particular, 
strong correlations between the columns of X, which lead to violations of the RE conditions, 
should have no effect on the intrinsic difficulty of the prediction problem. Recall that the -^o-based 
estimator 9^ 0 satisfies the prediction error upper bound (4) without any constraint on the design 
matrix. Moreover, Raskutti et al. [28] show that many problems with strongly correlated columns 
are actually easy from the prediction point of view. 

In the absence of RE conditions, ^i-based methods are known to achieve the slow 1 /\fn rate, 
with the only constraint on the design matrix being a uniform column bound [4]: 


Proposition 2 (Prediction error for Lasso without RE condition). Consider the standard linear 
model for a design matrix X satisfying the column normalization condition (10). Then for any 

vector 9* € Bo(fc) nBi(i?) ; the Lasso estimator 9\ n , with X n = 4a J 2 ^ d , satisfies the bound 


1 


n 


II X(9 Xn - 9 


*''"2 < c oR 


21 ogd 


n 


+ 5 


with probability at least 1 — c±de C2nS2 . 

Combining the bounds of Proposition 1 and Proposition 2, we have 

■ a 2 k log d 


M n ^ d (9 h ; X) < d min j - 


7 2 n 


aR 


log d) 


n 


(13) 


(14) 


If the RE constant 7 is sufficiently close to zero, then the second term on the right-hand side will 
dominate the first term. In that case, the 1 /y/n achievable rate is substantially slower than the 
1/n optimal rate for reasonable ranges of ( k,R ). One might wonder whether the analysis leading 
to the bound (14) could be sharpened so as to obtain the fast rate. Among other consequences, 
our first main result (Theorem 1 below) shows that no substantial sharpening is possible. 


3 Main results 

We now turn to statements of our main results, and discussion of their consequences. 

3.1 A general lower bound 

Our analysis applies to the set of local minima of the objective function L defined in equation (9). 

More precisely, a vector 9 is a local minimum of the function 9 1 —)■ L{9\ A) if there is an open ball B 

centered at 9 such that 9 € argminL($; A). We then define the set 

0eB 

©a := | 9 is a local minimum of the function 9 1-7 L(9; A) j, (15) 

an object that depends on the triplet (A, y, p) as well as the choice of regularization weight A. Since 
the function p might be non-convex, the set Q\ may contain multiple elements. 

At best, a typical descent method applied to the objective L can be guaranteed to converge to 
some element of Q\. The following theorem provides a lower bound, applicable to any method that 
always returns some local minimum of the objective function (9). 
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Theorem 1. For any pair ( n,d ) such that d>n> 4, any sparsity level k > 2 and any radius 
R > 8a/y/n, there is a design matrix X E M nxd satisfying the column normalization condition (10) 
such that for any coordinate-separable penalty, we have 


sup E 
0*eB o (fc)nBi(_R) 


inf sup -||X( 0 -r)||| 

A -°0£0 a U 


f 2 a R 1 
>c mm < a ,—=> 

l VnJ 


Moreover, for any convex coordinate-separable penalty, we have 


sup E 

0*eB o (fc)nBi(_R) 


inf inf l\\X(0-e*)\\2 

A>° 6 )e 0 A n 


f 9 cR 1 

> c mm < a ", —= > 

l VnJ 


(16a) 


(16b) 


In both of these statements, the constant c is universal, independent of (n, d, k, cr, R) as well as the 
design matrix. See Section 4.1 for the proof. 

In order to interpret the lower bound (16a), consider any estimator 6 that takes values in the set 
©A, corresponding to local minima of L. The result is of a game-theoretic flavor: the statistician is 
allowed to adaptively choose A based on the observations (y,X), whereas nature is allowed to act 
adversarially in choosing a local minimum for every execution of 6\. Under this setting, Theorem 1 
implies that 


sup 

0*eB o (fc)nBi(R) 


-E 


n 


\\xe x -x0* 


112 
112 


. / 2 ° R \ 
> c mm < a , —= > 

l y/n) 


(17) 


For any convex regularizer (such as the -penalty underlying the Lasso estimate), equation (16b) 
provides a stronger lower bound, one that holds uniformly over all choices of A > 0 and all (global) 
minima. For the Lasso estimator, the lower bound of Theorem 1 matches the upper bound (13) up 
to the logarithmic term \/log d, showing that the lower bound is almost tight. 


It is possible that lower bounds of this form hold only for extremely ill-conditioned design 
matrices, which would render the consequences of the result less broadly applicable. In particular, 
it is natural to wonder whether it is also possible to prove a non-trivial lower bound when the 
restricted eigenvalues are bounded above zero. Recall that under the RE condition with a positive 
constant 7 , the Lasso will achieve a mixture rate (14), consisting of a scaled fast rate l/( 7 2 n) and 
the slow rate 1 /y/n. The following result shows that this mixture rate cannot be improved to match 
the fast rate. 


Corollary 1 . For any sparsity level k > 1, any integers d = n > 4 k, any radius R > 8cr/y/n and 
any constant 7 E (0,1], there is a design matrix X € M rix ' i satisfying the column normalization 
condition (10) and the 7 -RE condition, such that for any coordinate-separable penalty, we have 


sup E 
6»*eB 0 (2fc)nBi(ii) 


inf sup — 1 | X(6 — 0* 
L^° 0 £§a 71 


> cmin 



ka 2 oR 1 
y/n) 


Moreover, for any convex coordinate-separable penalty, we have 


sup E 
0*eB o (2fc)nBi(i?) 


inf inf -\\X(6 - 6*)\\ 2 
A>o 0g0 A n 


> cmin 



ka 2 
'yn ’ 



(18a) 


(18b) 
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Since none of the three terms on the right-hand side of inequalities (18a) and (18b) matches the 
optimal rate (4), the corollary implies that the optimal rate is not achievable even if the restricted 
eigenvalues are bounded above zero. Comparing this lower bound to the upper bound (14), there 
are two factors that are not perfectly matched. First, the upper bound depends on logd, but there 
is no such dependence in the lower bound. Second, the upper bound has a term that is proportional 
to I/ 7 2 , but the corresponding term in the lower bound is proportional to I/ 7 . Proving a sharper 
lower bound that closes this gap remains an open problem. 

We remark that Corollary 1 follows by a refinement of the proof of Theorem 1. In particular, we 
first show that the design matrix underlying Theorem 1— call it X had —satisfies the 7 n -RE condition, 
where the quantity 7 ™ converges to zero as a function of sample size n. In order to prove Corollary 1, 
we construct a new block-diagonal design matrix such that each block corresponds to a version of 
X had . The size of these blocks are then chosen so that, given a predefined quantity 7 > 0, the new 
matrix satisfies the 7 -RE condition. We then lower bound the prediction error of this new matrix, 
using Theorem 1 to lower bound the prediction error of each of the blocks. We refer the reader to 
Section 4.2 for the full proof. 

3.2 Lower bounds for local descent methods 

For any least-squares cost with a coordinate-wise separable regularize!', Theorem 1 establishes the 
existence of at least one “bad” local minimum such that the associated prediction error is lower 
bounded by l/y/n. One might argue that this result could be overly pessimistic, in that the 
adversary is given too much power in choosing local minima. Indeed, the mere existence of bad 
local minima need not be a practical concern unless it can be shown that a typical optimization 
algorithm will frequently converge to one of them. 

Steepest descent is a standard first-order algorithm for minimizing a convex cost function [3, 5]. 
However, for non-convex and non-differentiable loss functions, it is known that the steepest descent 
method does not necessarily yield convergence to a local minimum [13, 33]. Although there exist 
provably convergent first-order methods for general non-convex optimization (e.g., [23, 20]), the 
paths defined by their iterations are difficult to characterize, and it is also difficult to predict the 
point to which the algorithm eventually converges. 

In order to address a broad class of methods in a unified manner, we begin by observing that 
most first-order methods can be seen as iteratively and approximately solving a local minimization 
problem. For example, given a stepsize parameter 7 > 0, the method of steepest descent iteratively 
approximates the minimizer of the objective over a ball of radius 7 . Similarly, the convergence of 
algorithms for non-convex optimization algorithms is based on the fact that they guarantee decrease 
of the function value in the local neighborhood of the current iterate [23, 20]. We thus study an 
iterative local descent algorithm taking the form: 

9 t+1 € arg min L(0; A), (19) 

fleBaM 4 ) 

where 7 > 0 is a given parameter, and 182 ( 7 ; 0 t ) : = {9 € W l | ||# — 9 t \\2 < 7 } is the ball of radius 
7 around the current iterate. If there are multiple points achieving the optimum, the algorithm 
chooses the one that is closest to 9 t , resolving any remaining ties by randomization. The algorithm 
terminates when there is a minimizer belonging to the interior of the ball 182 ( 7 ; $*)—that is, exactly 
when 9 t+1 is a local minimum of the loss function. 
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It should be noted that the algorithm (19) defines a powerful algorithm—one that might not 
be easy to implement in polynomial time—since it is guaranteed to return the global minimum 
of a nonconvex program over the ball B 2 In a certain sense, it is more powerful than any 
first-order optimization method, since it will always decrease the function value at least as much 
as a descent step with stepsize related to rj. Since we are proving lower bounds, these observations 
only strengthen our result. We impose two additional conditions on the regularizers: 


(iv) Each component function pj is continuous at the origin. 

(v) There is a constant H such that | p'j(x) — p'j(x) \ < H\x — x\ for any pair x, x € (0, 00 ). 

Assumptions (i)-(v) are more restrictive than assumptions (i)-(iii), but they are satisfied by many 
popular penalties. As illustrative examples, for the £i-norm, we have H = 0. For the SCAD 
penalty, we have H = l/(a — 1), whereas for the MCP regularizer, we have H = 1/6. Finally, in 
order to prevent the update (19) being so powerful that it reaches the global minimum in one single 
step, we impose an additional condition on the stepsize, namely that 


r? < min 



where B := -ri=- 

4y/n 


( 20 ) 


It is reasonable to assume that the stepsize bounded by a time-invariant constant, as we can always 
partition a single-step update into a finite number of smaller steps, increasing the algorithm’s time 
complexity by a multiplicative constant. On the other hand, the 0(1/y/n) stepsize is adopted 
by popular first-order methods. Under these assumptions, we have the following theorem, which 
applies to any regularizer p that satisfies Assumptions (i)-(v). 


Theorem 2. For any pair (n, d) such that d > n > 4, integer k > 2 and any scalars 7 > 0 and 
R > cr/yjn, there is a design matrix X £ satisfying the column normalization condition (10) 

such that 


(a) The update (19) terminates after a finite number of steps T at a vector 6 = 0 T+1 that is a local 
minimum of the loss function. 

(b) Given a random initialization 9 0 ~ JV(0, ^Idxd), the local minimum satisfies the lower bound 


sup E 

0*eB o (fc)nBi(-R) 


inf -\\X6-X0*\\l 
A>o n 


> cmin{/?,, a } 


a 


n 


Theorem 2 shows that local descent methods based on a random initialization do not lead to 
local optima that achieve the fast rate. This conclusion provides stronger negative evidence than 
Theorem 1, since it shows that bad local minima not only exist, but are difficult to avoid. 


3.3 Simulations 

In the proof of Theorem 1 and Theorem 2, we construct specific design matrices to make the 
problem hard to solve. In this section, we apply several popular algorithms to the solution of 
the sparse linear regression problem on these “hard” examples, and compare their performance 
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with the ^o-based estimator (3). More specifically, focusing on the special case n = d, we perform 
simulations for the design matrix X € M nxn used in the proof of Theorem 2. It is given by 


X = 


blkdiag { y/nA , y/nA ,..., y/nA } , 

S * v ^' 

n/2 copies 


where the sub-matrix A takes the form 


A = 


cos(a) 
sin (a) 


— cos(a) 
sin(a) ’ 


where a = arcsin(n 1//4 ). 


Given the 2-sparse regression vector 6* = (0.5,0.5, 0,... , 0), we form the response vector y = X9* + w, 
where w ~ IV(0, J nxn ). 

We compare the -^o-based estimator, referred to as the baseline estimator , with three other 
methods: the Lasso estimator [30], the estimator based on the SCAD penalty [14] and the estimator 
based on the MCP penalty [34]. In implementing the £o~based estimator, we provide it with the 
knowledge that k = 2, since the true vector 0* is 2-sparse. For Lasso, we adopt the MATLAB 
implementation [1], which generates a Lasso solution path evaluated at 100 different regularization 
parameters, and we choose the estimate that yields the smallest prediction error. For the SCAD 
penalty, we choose a = 3.7 as suggested by Fan and Li [14]. For the MCP penalty, we choose 
b = 2.7, so that the maximum concavity of the MCP penalty matches that of the SCAD penalty. 
For the SCAD penalty and the MCP penalty (and recalling that d = to), we studied choices of 

the regularization weight of the form A = C for a pre-factor C to be determined. As shown 
in past work on non-convex regularizers [22], such choices of A lead to low f^-error. By manually 
tuning the parameter C to optimize the prediction error, we found that C = 0.1 is a reasonable 
choice. We used routines from the GIST package [18] to optimize these non-convex objectives. 

By varying the sample size over the range 10 to 1000, we obtained the results plotted in Figure 2, 
in which the prediction error E[b||A(0 — |||] and sample size n are both plotted on a logarithmic 

scale. The performance of the Lasso, SCAD-based estimate, and MCP-based estimate are all 
similar. For all of the three methods, the prediction error scales as 1 /y/n, as confirmed by the 
slopes of the corresponding lines in Figure 2, which are very close to 0.5. In fact, by examining 
the estimator’s output, we find that in many cases, all three estimators output 0 = 0 , leading to 
the prediction error b||X(0 — 0*)||| = 4^. Since the regularization parameters have been chosen 
to optimize the prediction error, this scaling is the best rate that the three estimators are able to 
achieve, and it matches the theoretical prediction of Theorem 1 and Theorem 2. 

In contrast, the fo-based estimator achieves a substantially better error rate. The slope of the 
corresponding line in Figure 2 is very close to 1. It means that the prediction error of the f?o~based 
estimator scales as 1/n, thereby matching the theoretically-predicted scaling (4). 


4 Proofs 

We now turn to the proofs of our theorems and corollary. In each case, we defer the proofs of more 
technical results to the appendices. 
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Figure 2: Problem scale n versus the prediction error E[^||X(0 — 0*)II 2 ]- The expectation is com¬ 
puted by averaging 100 independent runs of the algorithm. Both the sample size n and the predic¬ 
tion error are plotted on a logarithmic scale. 


4.1 Proof of Theorem 1 


For a given triple (n, a, R), we define the angle a := arcsin 



and the two-by-two matrix 


A 


cos (a) —cos (a) 
sin(a) sin(a) 


(21a) 


Using the matrix A £ M 2x2 as a building block, we construct a design matrix X € W nxd . Without 
loss of generality, we may assume that n is divisible by two. (If n is not divisible by two, constructing 
a (ro — l)-by-d design matrix concatenated by a row of zeros only changes the result by a constant.) 
We then define 


X = 


blkdiag { y/nA, y/nA ,..., y/nA) 0 

y -v-' 

n/2 copies 


pnxd 


(21b) 


where the all-zeroes matrix on the right side has dimensions n x {d — n). It is easy to verify that 
the matrix X defined in this way satisfies the column normalization condition (10). 

Next, we prove the lower bound (16a). For any integers i. j € [d] with i < j, let denote the 
i th coordinate of 9, and let Qi-j denote the subvector with entries {$*,..., 8j}. Since the matrix A 
appears in diagonal blocks of X, we have 

n/2 

inf sup -|| X(0 - 9*)\\l = inf sup ^ P(0 (2i _ 1):2i - 9* 2i _ (22) 

-6>ee A n A -° 0 s§a «=i 
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and it suffices to lower bound the right-hand side of the above equation. 

For the sake of simplicity, we introduce the shorthand B := —and define the scalars 


7 i = min{p 2 i-i(-B),P 2 i(-B)} 

Furthermore, we define 


for each i = 1 ,..., n/ 2 . 


• — 


(cosce,since) if 7* = p 2 i-\{B) ,_(a;, W(2i-i)-.2i) 

(— cos a, sin a) if 7, = p2i{B) 


n 


(23) 


(24) 


Without loss of generality, we may assume that 71 = max { 7 *} and 7 $ = p 2 i~i(B) for all i € [n/2]. 

[n/2] 

If this condition does not hold, we can simply re-index the columns of X to make these properties 
hold. Note that when we swap the columns 2i — 1 and 2 i, the value of aj doesn’t change; it is 
always associated with the column whose regularization term is equal to 7 j. 

Finally, we define the regression vector $*=[-§ -§ 0 • • • 0] € M d . Given these definitions, 

the following lemma lower bounds each term on the right-hand side of equation ( 22 ). 

Lemma 1. For any A > 0. there is a local minimum 9\ of the objective function L(9\ A) such that 
l\\X(9 x -e*)\\ !>Ti + T 2 , where 


T\ := I 


A 71 > 4i?(sin 2 (a)i?- 


n/2 


T 2 := ^l[B/2 <w[<B 


i =2 


y/n 
B 2 


: ) sin 2 (a)(ii — 2 B)/ and 


(25a) 

(25b) 


Moreover, if the regularizer p is convex, then every minimizer 9\ satisfies this lower bound. 

See Appendix B for the proof of this claim. 

Using Lemma 1, we can now complete the proof of the theorem. It is convenient to condition 
on the event £ := {||^ 1 , 2 1 | 2 — § 2 }- Since || wi ; 2 1 |2 / cr2 follows a chi-square distribution with two 
degrees of freedom, we have P[£] > 0. Conditioned on this event, we now consider two separate 
cases: 


Case 1: First, suppose that A 71 > a 2 /n. In this case, we have 



and consequently 

Ti+T 2 >ri = sin 2 (a)(i?-2 B) 2 =— ?-(r-^=) > aR ' , (26a) 

^ A 32V y/n) “128^ 

where the last inequality holds since we have assumed that R > 8a/y/n. 
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Case 2: Otherwise, we may assume that A 71 < a 2 /n. In this case, we have 


n/2 


T 1 + T 2 >T 2 = (B/2 < w[ < B) 


i =2 


3a 2 


n 


(26b) 


Combining the two lower bounds (26a) and (26b), we find 


E 


1 


inf sup -\\xe-xo*\\i 

l^°eee x n 


> E 


nun 


in / 


£ 

n/2 


V > 


t 3 


where we have used the fact that {w'j}™// are independent of the event £. Using the inequality 
in {«> E ,-=2 *>»} > Y^i= 2 nun{ 2 a/n, bi}, valid for scalars a and we see that 


mm 


n/2 




i=2 


2a . 4a 
— < w < -= 
n \ n 


f aR 3cr 2 
\ 128 riy/n n J 


l[B/2 < w\ < B\ = P[L>/2 < w[ < B\, and the definition 


where we have used the fact that E 
B := 

Since w[ ~ N(0,a 2 /n), the probability P [2a/y/n < w[ < 4a / y/n\ is bounded away from zero 
independently of all problem parameters. Hence, there is a universal constant c 2 > 0 such that 
T 3 > C 2 inin|^, a 2 j. Putting together the pieces, we have shown that 


E 


inf sup —\\X6 — X9*\\\ 
A>° 0e@A n 


> P[£] T 3 > Co min / — a 2 

{ Vn 


which completes the proof of the theorem. 


4.2 Proof of Corollary 1 

Here we provide a detailed proof of inequality (18a). We note that inequality (18b) follows by an 
essentially identical series of steps, so that we omit the details. 

Let m be an even integer and let X m € M mxm denote the design matrix constructed in the 
proof of Theorem 1. In order to avoid confusion, we rename the parameters ( n,d,R ) in the con¬ 
struction ( 21 b) by ( n',d',R '), and set them equal to 

min {(L|, ^=}), (27) 

where the quantities ( k,m,n,R,a ) are defined in the statement of Corollary 1. Note that X m 
is a square matrix, and according to equation ( 21 b), all of its eigenvalues are lower bounded by 
( "l6 vrT ) 1//2 • By equation (27), this quantity is lower bounded by yjrrvy. 
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Using the matrix X m as a building block, we now construct a larger design matrix X £ R nxn 
that we then use to prove the corollary. Let m be the greatest integer divisible by two such that 
km < n. By the assumption that n > 4 k, we have m > 4. Consequently, we may construct the 
n x n dimensional matrix 


X : = 


:= blkdiag j y/ n/mX m , ...,y/ n/mX m; y/nl n —km j’ 
k copies 


(28) 


where I n -km is the ( n — fern (-dimensional identity matrix. It is easy to verify the matrix X satisfies 
the column normalization condition. Since all eigenvalues of X m are lower bounded by y/msy, we 
are guaranteed that all eigenvalues of X are lower bounded by y/rvy. Thus, the matrix X satisfies 
the y-RE condition. 

It remains to prove a lower bound on the prediction error, and in order to do so, it is helpful 
to introduce some shorthand notation. Given an arbitrary vector u £ M n , for each integer i £ 
{1 ,... , k}, we let £ M m denote the sub-vector consisting of the ((i— 1 )m + l)-th to the (im)-th 
elements of vector u, and we let U( k + 1 ) denote the sub-vector consisting of the last n — km elements. 
We also introduce similar notation for the function p(x) = pi(x\) + • • • + p n {x n )\ specifically, for 
each i £ {1,..., k}, we define the function p^ : R m —>• R via P(i){9) := J/jL l P(i-i)m+j(Qj)- 

Using this notation, we may rewrite the cost function as: 


m\) 


i k 

(ll y/n/mX m e^ - y^\\l + nXp^(9^ + h(9( k+1) ), 

i= 1 


where h is a function that only depends on 9( k+ 1 ). If we define 6 ^ := yfn/mQ^ and p'^(9) := 

^p^(y/m/n9), then substituting them into the above expression, the cost function can be rewritten 
as 

k 

G(9';X) ■= —^2 ^wlli + V(*)( 0 «)) + h(y/m/n ^ k+V) ). 

2—1 


Note that if the vector 9 is a local minimum of the function 9 i— > L(9 ; A), then the rescaled vector 
8’ := y/n/m 9 is a local minimum of the function O' H y G(8 A). Consequently, the sub-vector 9'^ 
must be a local minimum of the function 

— II X mO(i) - 3/(0 111 + p[i)(0{ i))- ( 29 ) 

Thus, the sub-vector 9'^ is the solution of a regularized sparse linear regression problem with design 
matrix X m . 

Defining the rescaled true regression vector (9*)' := y/n/m 9*, we can then write the prediction 
error as 


l - me - nwi = / E (liw,^, - («*)'(i))lli) +117+1) - 7+1)1 


2 — 1 
k 


2 — 1 


( 30 ) 
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Consequently, the overall prediction error is lower bounded by a scaled sum of the prediction errors 
associated with the design matrix X m . Moreover, each term T||x m (0^ — (0*)o))ll2 can be bounded 
by Theorem 1. 

More precisely, let Q(X, 2k, R ) denote the left-hand side of inequality (18a). The above analysis 
shows that the sparse linear regression problem on the design matrix X and the constraint 9* £ 
Bo(2/c) n can be decomposed into smaller-scale problems on the design matrix X m and 

constraints on the scaled vector ( 9 *)'. By the rescaled definition of ( 9 *)', the constraint 9* £ 
Bo(2fc) n holds if and only if ( 9*)' £ Bo(2fc) n Bi (y/n/mR). Recalling the definition of the 

radius R' from equation (27), we can ensure that (9*)' £ Bo(2fc) n Bi(y 'n/mR) by requiring that 
(0 *)Ln £ Bo(2) n Bi(R') for each index i £ {1,..., k}. Combining expressions (29) and (30), the 
quantity Q(X, 2k, R) can be lower bounded by the sum 

k 

Q(X,2k,R)>-J2 

i —1 

By Theorem 1, we have 

Q(X m ,2,R') > cmin ja 2 , ~j= \ = 

where the second equality follows from our choce of 
bounds (31a) and (31b) completes the proof. 

4.3 Proof of Theorem 2 

The proof of Theorem 2 is conceptually similar to the proof of Theorem 1, but differs in some key 
details. We begin with the altered definitions 

a := arcsin ( ^ and B := —where r := mini/?,, a}. 

\n L / 4: y/r J 4 yjn 

Given our assumption R > cr/y/n, note that we are guaranteed that the inequality 2 B = a/(2y/n) < r/2 
holds. We then define the matrix A £ M 2x2 and the matrix X £ M nxd by equations (21a) and (21b). 

4.3.1 Proof of part (a) 

Let {6^}£L 0 be the sequence of iterates generated by equation (19). We proceed via proof by contra¬ 
diction, assuming that the sequence does not terminate finitely, and then deriving a contradiction. 

We begin with a lemma. 

Lemma 2. If the sequence of iterates {$ 4 }£Lo no ^ fi n ^ e ^V convergent, then it is unbounded. 

We defer the proof of this claim to the end of this section. Based on Lemma 2, it suffices to show 
that, in fact, the sequence {#*}£L 0 is bounded. Partitioning the full vector as 9 := (0 \- n , 9 n+ i :( j), we 
control the two sequences {9\ :n }pLo and 


Q(x m , 2, r!). 


(31a) 


f 9 a 2 cr/?-v/n] . , . 

cmm < a 2 , —-, — - \ , (31b) 

( Lbjm km J 

R! from equation (27). Combining the lower 
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Beginning with the former sequence, notice that the objective function can be written in the 
form 

1 ^ 

L(Q] A) = — ||y — -X"l:n^l:n||! + ^Pi(0i), 

i —1 

where X\ :n represents the first n columns of matrix X. The conditions (21a) and (21b) guarantee 
that the Gram matrix X^. n Xi :n is positive definite, which implies that the quadratic function 
0\- n e+ ||y — Xi :n 9i :n \\2 is strongly convex. Thus, if the sequence {#f :n }££ 0 were unbounded, then 
the associated cost sequence {L(0 t -, A )}^ 0 would also be unbounded. But this is not possible since 
L(9 t ] A) < L(#°; A) for all iterations t = 1 , 2 ,.... Consequently, we are guaranteed that the sequence 
{9\ n }tZo must be bounded. 

It remains to control the sequence {9 t n+1 . d }^ :0 . We claim that for any i € {n + 1,..., d}, the 
sequence {|0*|}£^ o is non-increasing, which implies the boundedness condition. Proceeding via proof 
by contradiction, suppose that | 0 *| < | 0 - +1 | for some index i £ {n + 1,... ,d} and iteration number 
t > 0. Under this condition, define the vector 

21 +i , = f ^ ;+1 if 3 + * 

J ' \0) if j = i. 

Since pj is a monotonically non-decreasing function of |x|, we are guaranteed that L(0 t+l \ A) < 
L(9 t+1 ; A), which implies that 9 t+1 is also a constrained minimum point over the ball 182 ( 77 ; 9 t ). In 
addition, we have 

|| 9 t+1 - 0% = \\0 t+1 - 0% - \0\ - 0^ l \ < 77 , 

so that 0 t j^ 1 is strictly closer to 0 t . This contradicts the specification of the algorithm, in that it 
chooses the minimum closest to 0 l . 

Proof of Lemma 2: The final remaining step is to prove Lemma 2. We first claim that 
||# s — 0 l \\2 > 77 for all pairs s < t. If not, we could find some pair s < t such that H # 5 — 0*||2 < 77 . 
But since t > s, we are guaranteed that L{0 t - 1 A) < L(0 s+l \ A). Since 0 S+1 is a global minimum over 
the ball 182 ( 77 ; 0 s ) and ||@ s — 0*||2 < 77 , the point 0 l is also a global minimum, and this contradicts 
the definition of the algorithm (since it always chooses the constrained global minimum closest to 
the current iterate). 

Using this property, we now show that {0*}^ o is unbounded. For each iteration t = 0,1, 2 ..., we 
use B 4 = 182 ( 77 / 3 ; 0 l ) to denote the Euclidean ball of radius 77/3 centered at 0 t . Since \\0 S — 0 t \\2 > V 
for all s / t, the balls {B 4 }“ 0 are all disjoint, and hence there is a numerical constant C > 0 such 
that for each T > 1, we have 

T T 

vol ( uj =0 B f ) = vol(B 4 ) = CY 0 d - 
7=0 7=0 

Since this volume diverges as T —>• 00 , we conclude that the set B : = U^qB* must be unbounded. 
By construction, any point in B is within r//3 of some element of the sequence {0 t }^. o , so this 
sequence must be unbounded, as claimed. 
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4.3.2 Proof of part (b) 

We now prove a lower bound on the prediction error corresponding the local minimum to which 
the algorithm converges, as claimed in part (b) of the theorem statement. In order to do so, we 
begin by introducing the shorthand notation 

7 i = min{ sup p 2i _ i(«), sup /^(u)} for each i = 1,... ,n/2. (32) 

uG( 0 ,B] ue( 0 ,B] 

Then we define the quantities a* and w[ by equations (24). Similar to the proof of Theorem 1, we 
assume (without loss of generality, re-indexing as needed) that 7 * = sup ug (Q B ] P 2 i_i(n) and that 
7i = max ie[ n/2 ] {7i}- 

Consider the regression vector 6* : = [| | 0 • • • 0]. Since the matrix A appears in diagonal 

blocks of X, the algorithm’s output 9 has error 

n /2 

mf - <niii = w,e - e h<-»A\\i ( 33 > 

i=l 

Given the random initialization 6 °, we define the events 

£0 := | maxj^x,^} < oj and £\ := |A 7 i > 2sin 2 (o;)r H- ^ + 3f?j, 

as well as the (random) subsets 

§1 := € { 2 ,..., n/ 2 } | A 71 < 2w\ — 4f?|, and 

§2 := ji G ( 2 ,... ,n/ 2 } | 2 sin 2 (a)r + 2 ^ w ^ 2 +3 B < 2 w[ — 4H j. 

Here the reader should recall the definition of w' from equation (24). 

Given these definitions, the following lemma provides lower bounds on the decomposition (33) 
for the vector 9 after convergence. 

Lemma 3. (a) IfSodSi holds, then ||A (# 1,2 — #* :2 ) || 2 > 4 ^ 7 = • 

(b) For any index i € § 1 , we have || A (9 2i -i-2i ~ 0 2 i-i : 2 i) 111 > ^772 • 

(c) We have P[£o] = 1/4, and moreover min P[i € § 2 ] > c for some numerical constant c > 0. 

ie{ 2 ,...,n/ 2 } 

See Appendix C for the proof of this claim. 


Conditioned on event £q, for any index i € § 2 , either the event £q n £\ holds, or we have 

2||wi :2 || 2 


A 71 < 2 skr (a)r + 


n 


+ 3 B < 2w' i - 4B, 
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which means that i € §1 holds. Applying Lemma 3 yields the lower bound 


n/2 

bi f 0 E ll^( 2 i-i): 2 i - A9* (2i _ iy2i \\l > I[£ 0 ] min 

— i=l 


ar 

4 y/n’ 


ar 

8 n 3 / 2 


[n/2] 

E ^ € § 2] 


[n/2] 

= « E ^ e s d, 


where the last equality holds since g ^ 3 r / 2 E= 2 ^ *= § 2 ] < ^ 3/2 (u/2 — 1) < Since the event 

£q is independent of the event {i € § 2 , * = 2 ,..., n/ 2 }, we have 


E 


n/2 

mf 0 E P«(2i-I):2i - A6* {2i _ mi g > F[£o] 


i =1 


ar 

8 n 3 / 2 


[n/2] 


E p t i € § 2] 


i=2 


W 1 ffr . . . 

- 4 C(n ' 2 - 1] 

= £ min{i?, a\ —=, 
\Jn 


where step (i) uses the lower bound P[£o] = 1/4 and P[i € § 2 ] > c from Lemma 3. Combined with 
the decomposition (33), the proof is complete. 


5 Discussion 

In this paper, we have demonstrated a fundamental gap in sparse linear regression: the best 
prediction risk achieved by a class of M-estimators based on coordinate-wise separable regularizers is 
strictly larger than the the classical minimax prediction risk, achieved for instance by minimization 
over the ^o-ball. This gap applies to a range of methods used in practice, including the Lasso in its 
ordinary and weighted forms, as well as estimators based on nonconvex penalties such as the MCP 
and SCAD penalties. 

Several open questions remain, and we discuss a few of them here. When the penalty function 
p is convex, the M-estimator minimizing function (9) can be understood as a particular convex 
relaxation of the /'o-basecl estimator (3). It would be interesting to consider other forms of convex 
relaxations for the ^o~based problem. For instance, Pilanci et al. [27] show how a broad class of 
£o-regularized problems can be reformulated exactly as optimization problems involving convex 
functions in Boolean variables. This exact reformulation allows for the direct application of many 
standard hierarchies for Boolean polynomial programming, including the Lasserre hierarchy [21] 
as well as the Sherali-Adams hierarchy [29]. Other relaxations are possible, including those that 
are based on introducing auxiliary variables for the pairwise interactions (e.g., 7 ^ = 0i9j), and so 
incorporating these constraints as polynomials in the constraint set. We conjecture that for any 
fixed natural number t, if the the f-th level Lasserre (or Sherali-Adams) relaxation is applied to 
such a reformulation, it still does not yield an estimator that achieves the fast rate (4). Since a 
t^-level relaxation involves 0(d t ) variables, this would imply that these hierarchies do not contain 
polynomial-time algorithms that achieve the classical minimax risk. Proving or disproving this 
conjecture remains an open problem. 
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Finally, when the penalty function p is concave, concurrent work by Ge et al. [17] shows that 
finding the global minimum of the loss function (9) is strongly NP-hard. This result implies that no 
polynomial-time algorithm computes the global minimum unless NP = P. The result given here 
is complementary in nature: it shows that bad local minima exist, and that local descent methods 
converge to these bad local minima. It would be interesting to extend this algorithmic lower bound 
to a broader class of first-order methods. For instance, we suspect that any algorithm that relies 
on an oracle giving first-order information will inevitably converge to a bad local minimum for a 
broad class of random initializations. 
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A Fast rate for the bad example of Dalalyan et al. [12] 

In this appendix, we describe the bad example of Dalayan et al. [12], and show that a reweighted 
form of the Lasso achieves the fast rate. For a given sample size n > 4, they consider a linear 
regression model y = X9*+w, where X £ M nx2m with m = n— 1, and the noise vector w € (—1, l} n 
has i.i.d. Rademacher entries (equiprobably chosen in { — 1,1}). In the construction, the true vector 
6* is 2-sparse, and the design matrix X £ R nx2m given by 

' 1 T 1 T 1 

ImXm ^mxm_ 

where l m G M m is a vector of all ones. Notice that this construction has n = m + 1. 

In this appendix, we analyze the performance of the following estimator 

j m 

0£arg min -\\X6 - y\\\ + AWd^l + \9 m+i \). (34) 

0 eR 2m n 

It is a reweighted form of the Lasso based on G-norm regularization, but one that imposes no 
constraint on the first and the (m + l)-th coordinate. We claim that with an appropriate choice of 
A, this estimator achieves the fast rate for any 2-sparse vector 6*. 

Letting 9 be a minimizer of function (34), we first observe that no matter what value it attains, 
the minimizer always chooses 9\ and 9 m+ \ so that (X8) 12 = yv. 2 - This property occurs because: 

• There is no penalty term associated with 9\ and 9 m+ 

• By the definition of X, changes in the coordinates 8\ and 9 m+ i only affect the first two 
coordinates of X9 by the additive term 

J\ ' 

@m +1 

Since the above 2-by-2 matrix is non-singular, there is always an assignment to (0i,0 m+ i) so 
that (X9) i ;2 - 2 / 1:2 = 0. 


\/n 


1 

ei 


~ei 


X=^L 
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Thus, only the last n — 2 coordinates of X9 — y might be non-zero, so that we may rewrite the 
objective function (34) as 

-j m 

— \\X9 — y H 2 + A + \9 m +i\) 

n i =2 

1 m 

= — ((v^i — Vn@m+i — 2/i ) 2 + A(|0j| + |0 m +j|)^ (35) 

n i=2 

The function (35) is not strictly convex so that there are multiple equivalent solutions. Essentially, 
we need to break symmetry by choosing to vary one of 9% or 0 m +i, for each i € {2,..., m}. Without 
loss of generality, we assume that 9 m+ 2 = 9 m+ 3 = ■ ■ ■ = 02 m = 0, so that the equation is simplified 
as 

. m m 

— ||A"0 — 2 /||| + A ^^(|0*| + \9 m+ i\) = — {[s/n9i — yi ) 2 + A|0j|)^ . (36) 

2=2 2=2 

Moreover, with this choice, we can write the prediction error as 

1 0 I m 

R{9) ■= -II X9 - xe *III = - + - £(v^i - V^(0* - C+J) 2 - (37) 

n n n z —^ 

2=2 

The first term on the right-hand side is obtained from the fact ||(X0 — AC 0 *)i : 2 ||§ = 11 ^ 1 : 2 II 2 = 2, 
recalling that the set-up assumes that the noise elements takes values in { — 1 , 1 }. 

The right-hand side of equation (36) is a Lasso objective function with design matrix x ( m -i) • 

The second term on the right-hand side of equation (37) is the associated prediction error. By choos¬ 
ing a proper A and using the fact that 9* is 2-sparse, it is well-known that the prediction error scales 
as £ 7 (!“!£”), which corresponds the fast rate. (Here we have recalled that the dimension of the Lasso 
problem is m — 1 = n — 2 .) 


B Proof of Lemma 1 


Given our definition of X in terms of the matrix A € M 2x2 , it suffices to prove the two lower bounds 
\\A{9 x ) l , 2 -A9l 2 \\l (38a) 


> 


A 71 > 4H(sin 2 (a)i? + -— l '^L 2 ) sin 2 (a)(i? — 2 B) 2 + and, 


\\A(9\)2i-l:2i — 71021-1:2*11! 
> I 


(38b) 


0 < w[ < B 


1 B 2 

(— ~ A 7 O for * = 2, 3,..., n/2. 


In the proofs to follow, it is convenient to omit reference to the index i. In particular, viewing 
the index i as fixed a priori, we let u and u* be shorthand representations of the sub-vectors 
(0 a)2*-i, 2*, and 02*-i2*> respectively. We introduce the normalized noise e := W 2 i-i, 2 i/y/n. By 
our construction of the design matrix X in terms of A, the vector 9\ is a local minimize!' of the 
objective function if and only if u is a local minimum of the following loss: 


£(u; A) := || Au - Au* - e||| + Ap 2 i-i(«i) + A P2i(u 2 ), 
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where this statement should hold for each i € [n/2], Hence, it suffices to find a local minimum of 
£(u; A) such that the bounds (38a) and (38b) hold. 

B.l Proof of inequality (38a) 

If A 71 < 4H(sin 2 (a)iZ + ||e|| 2 ), then the lower bound (38a) is trivial, and in particular, it will hold 
for 

u := argmin^u; A). 

U 

Otherwise, we may assume that A 71 > 41? (sin 2 (a) f? + ||e|| 2 )- In this case (i = 1), we have 
u* = (R/2, R/2). Defining the vectors v* := Au* = (0, sin(a)-R) and u := (0,0), we have 

£(u; A) = ||Au - v * - er||| + \pi(ui) + Xp 2 {u 2 ) = ||u* + er||l- (39) 

We claim that 

inf [£{u; A) > £(u; A), (40) 

uedu 

where U := {u € M 2 | |ui| < B or |u 2 | < B}, and dU denotes its boundary. If p is a convex 
function, then the lower bound (40) implies that any minimizers of the function £(-',X) lie in the 
interior of U. Otherwise, it implies that at least one local minimum—say u —lies in the interior of 
U. Since u\ < B and U 2 < B, we have the lower bound 

||-4(#a)i:2 - ^4^1:2111 = II Au - v*\\ 2 = cos 2 (a)(ni - U 2) 2 + sin 2 (a)(R -u±- ft 2 ) 2 
> sin 2 (a)(i? — u\ — U 2) 2 > sin 2 (a)(i? — 2B) 2 + , 


which completes the proof. 

It remains to prove the lower bound (40). For any u € dU, we have 

(») 

£(u; A) = ||Au -v*- er||| + Api(«i) + \p 2 {u 2 ) > \\Au + v* + e||| + A 71 
(a) 

> ||v* + e||| + 2{v* + e) t Au + A 71 . (41) 

Inequality (i) holds since either u\ or U 2 is equal to B , and min{/ 9 i(F>), p 2 (B)} > 71 by definition, 
whereas inequality (ii) holds since ||a + b ||| > || 6||2 + 2 b T a. We notice that 

inf 2 (v*+e) t Au> inf {2(v*, Au) — 2||e ||2 ||Hn|| 9 } 

u£dU u£dU 

= inf < 2sin 2 (a)I?(ui + u 2 ) — 2||e||2'\/cos 2 (Q;)(rti + U 2) 2 + sin 2 (a:)(ui — U 2) 2 > 
u£dU ( ’ J 

> —4H(sin 2 (a)i? + ||e|| 2 )- 

Combining this lower bound with inequality (41) and the bound A 71 > 4H(sin 2 (a)ii + ||e|| 2 ) yields 
the claim (40). 
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B.2 Proof of inequality (38b) 

For i = 2,3,... ,n/ 2, consider u* = (0,0) and recall the vector a* = (cos a, sin a), as well as our 
assumption that 


7 i ■= P2i-i(B)/B < 


P2i{B) 

B 


Define the vector u := (aje, 0) and let u = arg min ug R 2 A) be an arbitrary global minimizer. 
We then have 

\\Au-e ||| < \\Au - e\\l + \pi(ui) + \p 2 (u 2 ) < \\Au - e\\l + \pi(ui) + \p 2 (u 2 ), 

since the regularizer is non-negative, and u is a global minimum. Using the definition of u, we find 
that 


\\Au - e||| < \\aidje - e||| + \pi{aje) = ||e|]l - (afe ) 2 + Api(afe), 

where the final equality holds since ataj defines an orthogonal projection. By the triangle inequality, 
we find that \\Au — ll^lli) an d combining with the previous inequality yields 

\\Au \\ 2 2 > (afe ) 2 - Api(afe). (42) 

Now if B /2 < a T e < B , then we have pi(afs) < pi{B) = 7 i < 7 i- Substituting this relation into 
inequality (42), we have \\Au — e\\\ > I(B /2 < afs < B ) [B 2 /A — A 71 ), which completes the proof. 

C Proof of Lemma 3 

Similar to the proof of Lemma 1, it is convenient to omit reference to the index i. We let u l and 
u* be shorthand representations of the sub-vectors u t 2i _ 1 2i , and 0 2i _ x 2i , respectively. We introduce 
the normalized noise e := W 2 i~i, 2 i/y/n. By our construction of the design matrix X and the update 
formula (19), the vector u l satisfies the recursion 

u t+1 € argmin t(u\ A), (43a) 

|| U—U 1 1 | 2</0 

where (3 := || u t+l — v? H 2 < 7 and the loss function takes the form 

£(u; A) := \\Au - Au* - e\\ 2 +\p2i-i(ui) + Xp2i{u2). (43b) 

:=T 

This statement holds for each i € [n/2]. Hence, it suffices to study the update formula (43a). 

C.l Proof of part (a) 

For the case of i = 1, we assume that the event £2 holds. Consequently, we have maxju^u^} < 0 
and A 71 > 2sin 2 (a)r + 2 ||e|| 2 . The corresponding regression vector is u* = (r/ 2 , r/ 2 ). Let us define 

61 € arg sup p 2 i-i(u) and b 2 € arg sup P 2 i(u). 
uG(0,B] uG(0,B] 
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Our assumption implies p' 2i _i(bi) > 71 and ^2,(^2) > 71 • We claim that 

u\ < B for k = 1, 2 and for all iterations t = 0,1,2,.... 

If the claim is true, then we have 

11-401:2 - -46>i : 2 ||! = cos 2 (a)(ui - u 2 ) 2 + sin 2 (a)(r - u\ - u 2 ) 2 
> sin 2 (a)(r — u\ — u 2 ) 2 > sin 2 (a)(r — 2 B) 2 + 


( 44 ) 


> 


ar 


4 y/n' 


where the final inequality follows from substituting the definition of a , and using the fact that 
2 B < r/2. Thus, it suffices to prove the claim (44). 

We prove the claim (44) by induction on the iteration number t. It is clearly true for t = 0. 
Assume that the claim is true for a specific integer t > 0, we establish it for integer t + 1. Our 
strategy is as follows: suppose that the vector u t+l minimizes the function ^(u; A) inside the ball 
{u : ||u — u l H2 < P}. Then, the scalar u* +1 satisfies 


4 +1 


u\' ~ = argmin f{x) where f(x) := £((x, u 2 );A). 
x : 11 (x ,U2 + 1 )— u l 11 2 < /3 


We now calculate the generalized derivative [11] of the function / at . It turns out that 

either u^ +1 < u\ < b\, or dfpa^ 1 ) 0 (—00,0] / 0. (45) 

Otherwise, there is a sufficiently small scalar 5 > 0 such that 

||K +1 - P u 2 +1 ) - xt*|| 2 < P and f(u\ +1 - 6) < f(u\ +1 ), 

contradicting the fact that is the minimum point. In statement (45), if the first condition is 
true, then we have u^ +1 < b\. We claim that the second condition also implies < b\. 

To prove the claim, we assume by contradiction that u^ +1 > b\ and 9/(u^ +1 ) 0 (—00,0] / 0. 
Note that the function f(x ) is differentiable for x > 0. In particular, for > 61, we have 


/Vi ) = 


dT 

du\ 


Ul=u \ +1 


+V2i-i( w i +1 ) 


= —2(sin 2 (a)r + aje) + 2u^ +1 — 2(1 — 2 sin 2 (a))u 2 +1 + A/3 2i _ 1 (tt^ +1 ). 


(46) 


Since 1 — 2sin 2 (a) < 1 and aje < ||e||2, and also because u^ +1 > b\ and u 2 +1 < u\ + P < 62 + P, 
equation (46) implies 


f\u\ +1 ) > —2 sin 2 (a)r - 2||e|| 2 + 2(6i - b 2 - P) + Ap 2i _ 1 (u t 1 +1 ). 
Recall that 61,62 € [0, B], and also using the fact that 

Pi(u\ +1 ) > p’i{bi) - PH > 71 - PH, 

we find that 


(47) 


f(u\ +1 ) > —2 sin 2 (a)r — 2||e|| 2 — 2 (B + P) + A(7! - PH). 
> — 2sin 2 (a)r — 2||e|| 2 —3 B + A71. 


(48) 
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Here the second inequality follows since f3 < r] < min{H, y^}. 
2sin 2 (a)r + 2||e || 2 + 3 B holds, inequality (48) implies that /'(rt^ 1 ) 
tradicts the assumption that 


Since the inequality A 71 > 
> 0. But this conclusions con- 


df(u\ +1 ) n (-oo, 0 ] / 0. 


Thus, in both cases, we have u ^ +1 < bi. 

The upper bound for 1 can be proved following the same argument. Thus, we have completed 
the induction. 


C.2 Proof of part (b) 

For the case of i = 2,3,... , n/2, we assume that the event i € §2 holds. Consequently, we have 
A 71 < 2 af £ — AB as well as our assumption that 

sup P2i-i(u) = h < 7i- 

itS(0,S] 

The corresponding regression vector is u* = (0,0). Let u be the stationary point to which the 
sequence {?/}£L 0 converges. We claim that 

\ui\ > B or \u 2 \ > B. (49) 

If the claim is true, then by inequality cos 2 (a) > sin 2 (a) obtained from the definition of a, we have 

\\A0 2 i-i:2i ~ ^4021—1:2*111 = cos 2 (a)(0 2 *-i - 0 2i ) 2 + sin 2 (a)( 0 2i _i + 0 2i ) 2 

> 2 sin 2 (a)((Si ) 2 + (S 2 ) 2 ) > 2 sin 2 (a:)I ? 2 
cr 3 /r ar 
8 n 3 / 2 — 8 n 3 / 2 

which completes the proof of part (b). Thus, it suffices to proof the claim (49). 

To prove the claim (49), we notice that u is a local minimum of the loss function £(■; A). Let 

f(x) := £((x,u 2 )-, A) 

be a function that restricts the second argument of the function £(•; A) to be u 2 . Since u is a local 
minimum of the function ^(-;A), the scalar u± must be a local minimum of /. Consequently, the 
zero vector must belong to the generalized derivative [11] of /, which we write as 0 S df(ui). We 
use this fact to prove the claim (49). 

Assume by contradiction that the claim (49) does not hold, which means that |Si| < B and 
|S 2 | < B. Calculating the generalized derivative of /, we have 

df (Si) = {—2af e + 2u\ — 2(1 — 2sin 2 (o:))S 2 + Agi : gi € dp 2i -i(ui)}. (50) 

Using the fact that |Si| < B, \u 2 \ < B and the upper bound 

\gi\ < sup / 4 _i(u) = 7 i< 71 , 
ue(o ,B) 

the derivative (50) is upper bounded by 

—2 aj£ + 2u\ — 2(1 — 2sin 2 (a)) : u 2 + Agq < —2 aj£ + AB + A 71 . 

Under the assumption that A 71 < 2af £ — AB, the above inequality implies 0 ^ df{u\), contradicting 
the fact that 0 S df(u\). 


27 




C.3 Proof of part (c) 

Recall that the 2-vector follows a N(0,b 2 l2x2) distribution, so that 


P[£ 0 ] = P[max{0?,0£} < 0] = j, 

which establishes the first claim of part (c). To prove the second statement, we notice that the 
inequality 2 sin 2 (a)r + 2 ll^jll 2 _|_ 3 b < 2w\ — 4 B can be written in the equivalent form 

2af W 2 i-l: 2 i ~ 2 ||xCl :2 || 2 > ■ (51) 

The inequality (51) holds if afiV 2 i-i: 2 i/& > 1 and Hu’i^ll !/ 0 ' 2 < 1/64. In fact, a[W 2 i-i-. 2 i/& is 
distributed as 1 V( 0 , 1 ); and Hiui^lli / 0 ' 2 follows a chi-square distribution with two degrees of freedom. 
The two events are independent, and each of them happens with (positive) constant probability. 
Thus, there is a numerical constant c > 0 such that inequality (51) holds with probability at least c. 
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